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The very dense breast of mammogram image makes the Radiologists often 
have difficulties in interpreting the mammography objectively and 
accurately. One of the key success factors of computer-aided diagnosis 
(CADx) system is the use of the right features. Therefore, this research 
emphasizes on the feature selection process by performing the data mining 
on the results of mammogram image feature extraction. There are two 
algorithms used to perform the mining, the decision tree and the rule 
induction. Furthermore, the selected features produced by the algorithms are 
tested using classification algorithms: k-nearest neighbors, decision tree, and 
naive bayesian with the scheme of 10-fold cross validation using stratified 
sampling way. There are five descriptors that are the best features and have 
contributed in determining the classification of benign and malignant lesions 
as follows: slice, integrated density, area fraction, model gray value, and 
center of mass. The best classification results based on the five features are 


generated by the decision tree algorithm with accuracy, sensitivity, 
specificity, FPR, and TPR of 93.18%; 87.5%; 3.89%; 6.33% and 92.11% 
respectively. 
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1. INTRODUCTION 

Breast cancer and cervical cancer are types of cancer that cause the highest mortality in women in 
Indonesia. Based on data [1] there are 330,000 cancer patients in Indonesia and the highest cancer prevalence 
is found in Yogyakarta Special Region of 4.1%. Breast cancer has established risks (e.g. family history, 
obesity, having dense breast) and emerging risks (e.g. low of vitamin D levels, unhealthy life style); 
therefore, the early detection can be conducted to reduce the mortality of breast cancer. In fact, if cancer are 
found in the early stage, there will be a great cure rate. However, mostly the cases of breast cancer in 
Indonesia are found in the advance stage because of low awareness. Mammography is one of the imaging 
technologies that can be used both for screening and for the diagnosis of breast cancer. Based on the BI- 
RADS lexicons for Mammography 2013, a hyperdensity mass with an irregular shape and spikulated margin 
is associated with malignancy. Other suspicious morphology is amorphous, coarse heterogenous, fine 
pleomorphic and fine linear or fine-linear branching calcification [2]. 

Currently a computer-aided system using the mammogram image with several different purposes 
has been developed, among others, to determine the level of risk of breast cancer [3], to detect the location 
considered abnormal in the image mammogram that is commonly called the CADE system [4], and to 
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diagnose the type of breast cancer considered as RoI on the mammogram image that is commonly called the 
CADx system. The use CADx is as the second opinion in diagnosing the breast cancer based on the reading 
of the mammogram image. 

In general, there are several stages to develop a computer-aided diagnosis system (CADx), among 
others: image acquisition, pretreatment, feature extraction, feature selection, classification and testing. At 
each stage, it needs to do the right choice of algorithm in order to be able to provide the diagnosis result 
accurately. In principle, the work pattern of CADx system follows the work pattern of pattern recognition 
system. One important factor that determines the success or failure in pattern recognition system is the use of 
the right features. According to [5] the right feature selection is a critical stage because the right features 
makes the pattern recognition system capable to distinguish between one object from another one in 
accordance with the characteristics of the object, one based on improved document frequency for the text 
classification [6]. Therefore, it is necessary to do the feature selection on a mammogram that is able to 
distinguish between benign from malignant lesions on the mammogram. 

Some researchers developing a computer-aided system aim at assessing the risk factors, detection 
and diagnosis of breast cancer using the features found on the mammogram, including: color feature [7], 
texture [8], [9], shape [10] and a combination among the three [11]. The use of the right features greatly 
affects the performance of the pattern recognition system. In computation, it is expected to use the features as 
minimum as possible and to be able to distinguish between one class from another. Therefore, it needs an 
algorithm that can be used to choose the best features among so many features. Some previous researches 
have applied several algorithms aimed at the feature selection, among others: the branch and bound algorithm 
[12], hill climbing algorithm [13] and multi structure co-occurrence descriptor [14]. However, some existing 
references are not specifically used yet to select the features in the mammogram image for the development 
of CADx of the breast cancer system. 

This research proposes the use of several methods of data mining that are used as the feature 
selection algorithm of the mammogram image. The algorithms used are the decision tree and the rule 
induction, afterwards the classification is performed on the features selected from the two algorithms using 
several classification algorithms to measure the performance. Besides, this research uses the primary data, 
which types of lesions (benign and malignant) have been classified by the Radiologists not only based on the 
visual assessment but also verified based on the results of laboratory tests and assessment using other 
imaging technology that is ultrasound technology. 


2. RESEARCH METHOD 
This research uses the six-stage process for developing a computer-based system for the diagnosis of 
breast cancer, including: 


2.1. Mammography Image Acquisition 

This research uses the primary data in the form of mammogram image produced by digital 
mammography imaging technology that is conducted in Kotabaru Oncology Clinic Yogyakarta. The number 
of mammogram image successfully obtained from the probandus is 117 lesions of mammograms form two 
views, CC (Cranio Caudal) and MLO (mediolateral oblique). Furthermore, the Radiologists in this case as the 
researchers, conduct a visual analysis of the mammogram. In assessing the mammogram image, the 
Radiologists do not only interpret the mammogram image, but also match the interpretation result with the 
interpretation of the image that is the imaging results with other technologies, in this case using ultrasound 
technology and the results of pathology tests. In the analysis of the mammogram image, the Radiologists 
need to crosscheck to some test results using other data in order to provide the valid annotations on parts that 
are considered as the disorders / cancer, hereinafter referred to as RoI (Region of Interest). Besides providing 
Rol annotation on the mammogram image, the Radiologists classify it into two categories as benign lesions 
and malignant lesions. Data of 117 mammograms is divided into benign lesions amounted 79 benign 
mammogram and malignant lesions amounted 38 mammograms. The resulting image of mammography 
imaging has the same size that is 2424x3296 pixels, but the image of the cropping results, which is the 
annotations of Radiologists, has the very various sizes because it depends on the level of the vastness of the 
area of Rol itself. 


2.2. Praprocessing 

Interpreting the mammogram image is a very difficult job because the image resulting from the 
mammography technology has a very low quality. One of the characters is having a very low level of contrast 
that is very difficult to distinguish between the RoI from the fatty tissue. Therefore, before performing the 
feature extraction, the mammogram image quality needs to be improved, hereinafter referred to as 
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pretreatment process that aims to get the better qualified image. Some processes performed at this stage 
include: normalization of mammogram image size to be 256x256 pixels with bilinear interpolation; removing 
the background of mammogram image with a rolling ball radius of 50 pixels; removing the noise by median 
filtering with a radius of 2 pixels; improving the image contrast using CLAHE (Contrast-Limited Adaptive 
Histogram Equalization) method with block size of 127, histogram bins of 256 and maximum slope of 3; 
besides using CLAHE to improve the image contrast also using equalization histogram with saturated pixel 
of 0.4%. The results of each stage of pretreatment are shown in Figure 1 and Figure 2. 


(a) (b) (c) 


= 


RS | 


0 255 0 255 0 255 
Count: 7989504 Min: 0 Count: 65536 Min: 75 Count: 65536 Min: 0 
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StdDev: 41.217 Mode: 0 (5663777) StdDev: 27.530 Mode: 143 (930) StdDev: 9.791 Mode: 11 (3228) 
(d) (e) (f) 


Figure 1. (a) original image with view CC, (b) Resize Rol, (c) background removal, 
(d-f) the histogram 
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Figure 2. (a) median filtering, (b) CLAHE, (c) equalization histogram, (d-f) the histogram 
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2.3. Feature Extraction 

One key to the success of a pattern recognition system, which in this case is CADx system, is the 
right use of features that are able to distinguish between benign lesions from malignant lesions. Therefore, it 
is necessary to study the use of algorithms that can be used to select and evaluate the use of features 
precisely. In general, this research performs the feature extraction on a mammogram image using two types 
of feature domain, shape domain (14 descriptors) and texture domain using the equation shown in Table 1 
and Table 2. The use of features on the texture domain (24 descriptors) consists of the first order statistics 
and the second order statistics commonly called GLCM (gray level co-occurrence matrix). The feature of 
GLCM uses four directios (0°, 45°,90° and 135°) and the average value for each feature with the four 
directios. 


Table 1. Some features for shape domain 


Fitur Expression 
Area Area of selection in square pixels 
Center of Massa the brightness-weighted average of the x and y coordinates all pixels in the image or selection 
Modal gray value the highest peak in the histogram 
Centroid the average of the x and y coordinates of all of the pixels in the image or selection 
Perimeter The length of the outside boundary of the selection 
Integrated density The sum of the values of the pixels in the image or selection 
Median The median value of the pixels in the image or selection 
Area fraction For thresholded images is the percentage of pixels in the image or selection that have been 
highlighted. For non-thresholded images is the percentage of non-zero pixels 
Stack position The position (slice, channel and frame) in the stack or hyperstack of the selection 
Circularity Circularity = 47 x (area) (perimeter)? 
Aspect Ratio (AR) The aspect ratio of the particle’s fitted ellipse 
Roundness the inverse of Aspect Ratio 
24: . je 4 area 
Solidity Solidity= nea aa 
Table 2. Some features for texture domain 
Fitur Expression 
255 
Mean (u) w=) kx pl) 
255 
Standard deviation (a) o? = Şa — p)? x p(k) 
Skewness o> Ji = H)Pha) 
n 
1 4 
Kurtosis a, = =Y h — mtp(fn- 3) 
4 n 
Contrast (CONT) CONT_d_@ = Xio D Ci j)’ Paoli j) 
L-1 L-1 (i C p= Z 
ixjx i, 
Correlation (CORR) CORR_d_0 = jxPao (LJ) T Hx X Hx 
i=0 j=0 
Energy Energy_d_0= Lii Dib Pau, 
H s H d0 = L-1 L-1 1 
omogeneity (HOMO) omo_a_ Dizo j=0 IG)? ar Paoli j) 


After praprocessing, the shape and texture domains are extracted on the mammogram image. There 
are three scenarios of experiments conducted as follows: first, using all the features of shape and texture 
amounted 38 descriptors simultaneously, second, using the shape feature with 14 descriptors, and third, using 
the texture feature with 24 descriptors. Scheme of feature use for both domains are shown in Table 3. 
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Table 3. The use of feature descriptor for experiment 


Node Feature Deskriptor 
1 Shape and texture features 38 
2 Shape feature 14 
3 Texture feature 24 


2.4. Feature Selection 

The proper use of the features may provide the optimal classification results, besides, in 
computation, it may also reduce the burden of processor for unimportant data processing. Therefore, in this 
research the researchers conduct the data mining as the results of feature extraction with three nodes as noted 
in Table 3. To perform the feature selection, the mammogram image uses two algorithms those are decision 
tree and rule induction. Decision tree is a powerful and popular algorithm for classification and prediction. Its 
other advantage is being able to represent some rules that are easily understood by the humans and the 
knowledge can be used as the data in the database [16]. While the rule induction algorithm is one of the 
algorithms implemented on machine learning that is able to formulate some rules extracted from a collection 
of observation data. The results of data extraction in the form of a rule are the data model in the scientific 
form that represents some data patterns [17]. The example of the use of decision tree and rule induction for 
the first node with 38 descriptors is shown in Figure 3 and Table 4. 

Some important features are obtained based on the results of mining using decision tree and rule 
induction for the 38 descriptors of mammogram images. The important features generated by the decision 
tree algorithm (see Table 5, scenario I) include: kurtosis, area fraction and mean, while the important features 
generated by the rule induction algorithm (see Table 5, scenario II) include: slice, mean, area fraction and 
contrast with the angle 135. The same thing is applied to node 2 and 3 using decision tree and rule induction 
algorithms, in which the mining results are shown in Table 5 (scenario III and IV) for node 2 and Table 5 
(scenario V and VI) for node 3. The features used in the experiment with scenarios VII is an important 
feature generated by the first node using both the decision tree algorithm and the rule induction algorithm. 
The same thing is applied to scenario VIII and IX using the best features of the experimental results of 
node II and III. The detailed results of experiment can be seen in Table 5. 
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Figure 3. The graph of the result of decision tree algorithm 


Table 4. The rule generated by the rule induction for classification of benign dan malignant lesions using 38 


descriptors 
Node Rule 
1 if Slice > 38500 then Benign (0/41) 
2 if Mean > 109600 and AreaFraction < 99730 then Malignant (19/0) 
3 if AreaFraction < 99785 and Contrast_135 < 0.373 then Benign (3/30) else Ganas (14/7) 
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Table 5. The selected features resulting from some scenarios of experiments using decision tree and rule 
induction algorithms 


Scen 


is Descriptor Decision Tree Scenario Descriptor Rule Induction 


I 3 Kurtosis, area fraction, mean I 3 Slice, mean, area fraction, contrast_135 


Area fraction, median, integrated 
Ill 6 density, slice, modal gray value, IV 5 
center of massa 


Slice, integrated density, Area fraction, 
modal gray value, center of massa 


Mean, kurtosis, contrast_0, energy_45, 


Vv 2 Kurtosis, mean VI 4 correlation_45 


Slice, integrated density, Area fraction, 
modal gray value, center of massa 
Slice, mean, area fraction, contrast_135, 
Ix 2 Kurtosis, mean X 8 integrated density, modal gray value, 
center of massa, kurtosis 


VII 2 Area fraction, mean VII 5 


2.5. Classification 

Having obtained some of the selected features for each scenario based on the decision tree and rule 
induction algorithms, then the researchers conduct a classification process of mammogram image into two 
classes, benign lesions and malignant lesions. In this classification stage, the researchers use several 
algorithms, among others: k-nearest neighbors (KNN), decision tree (DT) and Naive Bayesian (NB) that 
further will be expressed in the points of discussion. Based on the feature selection process in the previous 
process, there will be a classification process on ten scenarios predefined previously to measure the 
performance. 


2.6. Evaluation 

To evaluate the results of classification of some features based on the selected feature in each 
scenario, the data is automatically divided using the k-fold cross validation (with 10 k number) in stratified 
sampling way. Besides, this research also uses five statistical parameters that are commonly used in medical 
diagnostic result test including: accuracy, sensitivity, specificity, false positive rate (FPR) and true positive 
rate (TPR). The aim of using the five parameters is to know how reliable and consistent a system to make 
diagnosis of breast cancer. Accuracy is the amount of data that is successfully predicted correctly by the 
classification system either negatively or positively, in which the sensitivity is a measure of success of the 
classification system in identifying the positive data correctly and the specificity is a measure of success of 
the classification system in identifying the negative data correctly. FPR shows the average of positive cases 
identified as the wrong one and TPR for the opposite case. Associations between FPR and TPR parameters 
can be represented graphically that is called the ROC curve. The use of the ROC curves is to assist in making 
decision in the search for the best model for the diagnosis of breast cancer. The calculation of the five 
parameters is shown in Table 6. 

Illustration: TP (True Positive); TN (True Negative); FN (False Negative); FP (False Positive); false 
positive rate (FPR) dan true positive rate (TPR). 

General description for each stage is shown in Figure 4. 


Mammography Praprocessing 


Image Acquisition > ; ae 


Feature Extraction 


Evaluation : Classification Feature Selection 
F (Decision Tree and rule Induction) 


= Benign 
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True Positive Rate 


Figure 4. General description of research stages 
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Table 6. Formula to obtain the values of sensitivity, specificity dan accurcy 


No Formula 

1 Sensitivity = TP/(TP + FN) 

2 Specificity = TN/(TN + FP) 

3 Accuracy = (TN + TP)/(TN +TP + FN + FP) 
4 FPR = FP / (FP+TN) 

5 TPR = TP / (TP + FN) 


3. RESULTS AND ANALYSIS 

The purpose of this research is to find the best features that are used to develop the CADx system 
for breast cancer on a mammogram image. Therefore, in this research, the researchers have conducted several 
experiments with ten scenarios, in which each scenario consists of six stages of research that has been 
described as shown in Figure 4. As an example for experiment with the first scenario, after shooting using the 
mammography technology, the researchers conducted several times a pretreatment process that has been 
described in detail in section 2.b. The output of these stages is the obtainment of mammogram images with 
better quality, so that visually the Radiologists can differentiate between fatty tissue and fat, which 
previously it was very difficult to distinguish between these two areas because it is a very thin network with 
no much different intensity. The next stage is to perform the feature extraction of 38 descriptors (a 
combination of shape and texture features); then the results of the feature extraction are selected using a 
decision tree (scenario I). The results of the mining process using a decision tree is a fact that not all features 
are able to contribute in determining the class of breast cancer (benign and malignant). There are only three 
descriptors that contribute as shown in Table 5. The next process is the stage of mammogram lesion 
classification into two classes (benign and malignant) using the algorithm of K-Nearest Neighbor (KNN), 
decision Tree (DT) and Naive Bayesian (NB). A classification is performed in the use of features for each 
scenario using three classification algorithms and there is an evaluation process using 10-fold cross 
validation. The complete results for each stage of the evaluation are shown in Table 7. The highest accuracy 
value is obtained at the CADx system to classify between benign and malignant lesions in scenario IV and 
VIII (using the five descriptors as shown in Table 5) with the classification algorithm of Decision Tree 
amounted 93.18%. The use of the five descriptors also provide values of FPR, TPR, Precision and Recall of 
6%; 92%; 88% and 92%, while the rule can be used to classify both types of breast cancer as shown in 
Table 8. 


Table 7. Evaluation result of the use of selected features (based on the algorithms of decision tree dan rule 
induction) in accordance with the scenario 


Classification Method Acc (%) Sens (%) Spec (%) FPR (%) TPR(%) 
SCENARIO I 
KNN 76,29 63,16 17,72 17,72 63,16 
DT 82,05 72,97 13,75 12,66 71,05 
NB 60,83 44,74 9,76 53,16 89,47 
SCENARIO II 
KNN 76,29 62,5 16,88 18,99 65,79 
DT 88,26 76,09 4,23 13,92 92,11 
NB 77,05 59,02 3,57 31,65 94,74 
SCENARIO III 
KNN 71,36 56,25 23,53 17,72 47,37 
DT 90,61 81,39 4,05 10,13 92,11 
NB 77,12 59,65 6,67 29,11 89,47 
SCENARIO IV and VIII 
KNN 71,36 56,25 23,53 17,72 47,37 
DT 93,18 87,5 3,89 6,33 92,11 
NB 75,3 57,63 6,89 31,65 89,47 
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Classification Method Acc (%) Sens (%) Spec (%) FPR (%) TPR(%) 
SCENARIO V and IX 

KNN 77,27 64,87 17,5 16,46 63,16 

DT 76,06 72,73 23,16 7,59 42,11 

NB 70,15 52,83 15,63 31,65 73,68 
SCENARIO VI 

KNN 77,27 64,86 17,5 16,46 63,16 

DT 75,23 68 22,83 10,13 44,74 

NB 62,5 45,71 12,77 48,10 84,21 
SCENARIO VII 

KNN 79,77 67,5 14,29 16,46 71,05 

DT 83,03 72,5 11,69 13,92 76,32 

NB 61,74 45,57 5,26 54,43 85,71 
SKENARIO X 

KNN 71,36 235 3,2 17,72 47,37 

DT 84,55 77.78 0,52 10,13 73,68 

NB 76,21 58.62 0,3 30,38 89,47 


Table 8. The rule resulting in the rule induction for the classification of benign and malignant lesions using 
five descriptors (selected features of the result of scenario IV and VID 
Node Rule 
if Slice > 38500 then Benign (0 / 41) 
if IntegratedDensity > 7171500000 and AreaFraction < 99730 then Malignant (19 / 0) 
if AreaFraction < 99785 and AreaFraction > 99705 then Benign (0/ 18) 
if AreaFraction < 99695 and AreaFraction > 99610 then Benign (0/ 13) 
if ModalGrayValue > 17500 then Malignant (18 / 2) 
if Center of Massa < 131650 then Benign (0/5) 
else Malignant (0 / 0) 


ADUNBWNKE 


Classification algorithm (KNN, DT and NB) is used for the development of CADx system with 
selected features (five descriptors); further, the testing is performed with 10-fold cross validation and 
visualized using RoC curve as shown in Figure 5. DT algorithm performs best performance when compared 
to the other two algorithms. 


Figure 5. The testing result of feature use: slice, integrated density, Area fraction, modal gray value and 
center of massa 
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4. CONCLUSION 

Based on the results of experiments and tests, it can be concluded that decision tree and rule 
induction, which is one of the mining algorithms, can be used as an alternative method for selection of 
features on the mammogram image. The results of feature selection conducted with ten scenarios obtain five 
descriptors that have contributed in the mammogram image classification into two classes, benign lesions and 
malignant lesions. The best classification results based on the five features (slice, integrated density, Area 
fraction, gray capital value, center of mass) are generated by the decision tree algorithm with accuracy, 
sensitivity, specificity, FPR and TPR of 93.18%; 87.5%; 3.89%; 6.33% and 92.11%. 
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